Decentralized Management of Bi-modal Network Resources 
in a Distributed Stream Processing Platform 
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Abstract 



This paper presents resource management techniques 
for allocating communication and computational re- 
sources in a distributed stream processing platform. The 
platform is designed to exploit the synergy of two classes of 
network connections - dedicated and opportunistic. Pre- 
vious studies we conducted have demonstrated the bene- 
fits of such bi-modal resource organization that combines 
small pools of dedicated computers with a very large pool 
of opportunistic computing capacities of idle computers to 
serve high throughput computing applications. This pa- 
per extends the idea of bi-modal resource organization 
into the management of communication resources. Since 
distributed stream processing applications demand large 
volume of data transmission between processing sites at 
a consistent rate, adequate control over the network re- 
sources is important to assure a steady flow of processing. 
The system model used in this paper is a platform where 
stream processing servers at distributed sites are intercon- 
nected with a combination of dedicated and opportunistic 
communication links. Two pertinent resource allocation 
problems are analyzed in details and solved using decen- 
tralized algorithms. One is mapping of the processing and 
the communication tasks of the stream processing work- 
load on the processing and the communication resources 
of the platform. The other is the dynamic re-allocation of 
the communication links due to the variations in the capac- 
ity of the opportunistic communication links. Overall opti- 
mization goal of the allocations is higher task throughput 
and better utilization of the expensive dedicated links with- 
out deviating much from the timely completion of the tasks. 
The algorithms are evaluated through extensive simulation 
with a model based on realistic observations. The results 
demonstrate that the algorithms are able to exploit the syn- 
ergy of bi-modal communication links towards achieving 
the optimization goals. 
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1 Introduction 

Many applications on the Internet are creating, manip- 
ulating, and consuming data at an astonishing rate. Data 
stream processing is one such class of applications where 
data is streamed through a network of servers that operate 
on the data as they pass through them [1, 2, 3, 4, 5, 6, 7]. 
Depending on the application, data streams can have com- 
plex topologies with multiple sources or multiple sinks. 
Examples of data stream processing tasks are found in 
many areas including distributed databases, sensor net- 
works, and multimedia computing. Some examples in- 
clude: (i) multimedia streams of real-time events that are 
transcoded into different formats [8], (ii) insertion of infor- 
mation tickers into multimedia streams [9], (iii) real-time 
analysis of network monitoring data streams for malicious 
activity detection [10], and (iv) function computation over 
data feeds obtained from sensor networks [4] . 

One of the salient characteristics of this class of appli- 
cations is the simultaneous demand for high-throughput 
computing and communication resources [11]. Huge vol- 
ume of data generated at high rates need to be processed 
within real-time constraints. Moreover, various operations 
on these data streams are provided by different servers 
at distributed geographic locations [12]. All these fac- 
tors demand a scalable and adaptive architecture for dis- 
tributed stream processing platform, where fine-grained 
control over processing and network resources is possible. 

Earlier works on stream processing engines [13, 14] re- 
sorted to centralized single-server or server-cluster based 
solutions where tighter control over available resources is 
possible. With the possibility of different processing ser- 
vices or operations being provided by different providers, 
need for distributed stream processing platform arose. Sev- 
eral architectures have been proposed to support such dis- 
tributed processing of streams [11, 15, 12, 16]. Due to the 
stringent rate-requirement for processing and transmission 
of data, most researchers have assumed a central resource 
controller that can gather the availability status of all re- 



sources and map the requested tasks on them. However, 
with the advent of a diverse range of stream processing 
services, it is important to allow autonomous providers of 
services to collaborate and share their resources. Thus it 
is important to develop decentralized resource allocation 
schemes, where control is available over local resources 
only. 

While it is feasible to have dedicated server resources 
and precisely allocate them for processing tasks, dedicated 
networks over wide-area installations remain costly. Al- 
though it is possible to propagate the data streams through 
the distributed servers using the Internet, the lack of ad- 
equate control over end-to-end bandwidth on the Inter- 
net and the stringent rate requirements of the stream pro- 
cessing applications demand some dedicated network re- 
sources. In fact, recent advances in optical network tech- 
nologies such as user-controlled light path [17, 18] open 
the possibility of on-demand provisioning of end-to-end 
optical links with total control of the available bandwidth 
is exposed to the user application. 

In this paper, we explore a novel approach where a 
combination of dedicated and opportunistic communica- 
tion links is used to interconnect the servers. The main fo- 
cus of this paper is to explore how such a hybrid (denoted 
as bi-modal in this paper) network can be best used for 
data stream processing tasks. The hypothesis that drives 
this work is that the combination has a synergistic effect 
that allows better utilization of the dedicated resources, 
and yields higher return on investment. We devised dis- 
tributed algorithms for allocation of these hybrid resources 
to demonstrate the viability of this synergy hypothesis. 

Multiple global objectives such as higher task through- 
put, lower violation of SLA and higher utilization of ded- 
icated resources make the resource management a com- 
plex task, especially when allocation decisions are to be 
taken solely based on the local information available on 
the server nodes. We divided the overall resource man- 
agement process into two steps - initially individual tasks 
are assigned node and link resources through a distributed 
mapping algorithm. Based on actual resource availabil- 
ity, link resources are then periodically re-allocated locally 
among competing tasks towards the global optimization 
objectives. 

This paper extends some of our previous works [19, 
20] on bi-modal compute platforms where static small 
pool of dedicated compute-servers was combined with a 
large number of opportunistically harvested cheap pro- 
cessing elements to increase work throughput and utiliza- 
tion of dedicated resources. Using data stream process- 
ing tasks as a concrete example, this paper demonstrates 
the benefit of using bi-modal network infrastructures for 
communication-intensive applications. In particular, this 
paper makes the following contributions to this important 
resource management problem: 



• Show that the bi-modality of the network helps to im- 
prove the utilization of dedicated resources such as 
servers and network links. 

• Show that the bi-modal organization allows the plat- 
form to admit significantly larger workload and 
yield significantly higher throughput without deviat- 
ing much from the service contracts. 

• Show the importance of adaptive scheduling to cope 
with the variability in the capacity of the opportunistic 
network. 

In Section 2 we present the system model for the dis- 
tributed stream processing platform and assert the neces- 
sary assumptions. Section 3 introduces and characterizes 
the two resource management problems pertinent to the 
platform - the problem of mapping the tasks to the re- 
sources and the problem of periodically re-allocating the 
resources to adapt with the ever-changing behavior of the 
opportunistic resources. Section 4 explains our proposed 
solution to the mapping problem. Section 5 explains the 
algorithm for periodic re-allocation of the communication 
resources. The algorithms presented in both the Section 4 
and Section 5 are local algorithms engineered to gradu- 
ally achieve some global optimization objectives such as 
high throughput and resource utilization. In Section 6, we 
evaluate through extensive simulations the extent to which 
these global objectives are achieved by the algorithms. We 
then conclude with a discussion of related literature in Sec- 
tion 7. 

2 System Model and Assumptions 

2.1 System Model 

In a stream processing task, the data stream originating 
from a data-source node, progresses through several steps 
of processing, termed as service components (or service 
in short), before being delivered to the data-delivery node. 
For example, in video streaming, the service components 
may be encoding of video, embedding some real time tick- 
ers and transcoding the video into different formats. Al- 
though, in very general terms, the data-flow topology could 
be arbitrary graphs, in this paper, we restrict our study to 
simple path topologies. 

The distributed stream processing platform consists of 
several autonomous server nodes that serve the service 
components. A single server may serve multiple services 
and a service may be available at multiple servers. Several 
pairs of servers establish dedicated point-to-point links be- 
tween them to have the flow of the data streams at a con- 
trolled rate. Each server is also connected to the public In- 
ternet and end-to-end TCP connection can be established 
between any pair of servers through the Internet. How- 
ever, with the Internet, end-to-end bandwidth of the TCP 



connections cannot be allocated and the flow rate cannot 
be controlled. These connections are thus treated as op- 
portunistic resources. Both the dedicated and opportunis- 
tic links are assumed to be bi-directional and of symmet- 
ric capacity, for both data-transport and control messaging 
purposes. The assumption on the bi-directionality of data- 
transport is not absolutely necessary for such platforms, 
the assumption is rather made for the convenience of dis- 
course. 

The platform is modeled as an asynchronous message 
passing distributed system, where there is no centralized 
controller to coordinate the resources. The servers have 
knowledge of and can precisely allocate the local resources 
only, i.e. the processing capacity of the node and the band- 
width of the outgoing communication links. However, the 
servers comply with the global protocol and respond to a 
predefined set of messages in a predefined way. The objec- 
tive of the global protocol is to ensure adequate resources 
for each individual task for its seamless progress, and to 
maximize the global work throughput. Other factors such 
as balancing the load among different servers and maxi- 
mizing the utilization of dedicated resources are also con- 
sidered. Design and evaluation of the protocol constitute 
the remaining sections of the paper. 

Figure 1 illustrates a scenario of a stream processing 
platform containing five servers. The example stream pro- 
cessing task shown in the figure requests a data stream 
from data source d 2 to be processed through services a 2 , 
as, (14 and a 5 , and to be delivered to Si. This task may be 
served by the servers £4 (serving d 2 ), S3 (serving a 2 ), S 2 
(serving a 3 and CI4). Either dedicated link or public net- 
work may be used to transmit the data stream between any 
two consecutive servers. 

For convenience, the resource allocation process is di- 
vided into two phases. First, individual tasks with multiple 
service components are mapped on the processing servers 
fulfilling the processing and transport capacity require- 
ments. A cost function is used to select the best among 
multiple feasible maps. The second phase re-allocates the 
link bandwidths among competing tasks, after the tasks 
start execution based on the initial allocation. This is nec- 
essary because of the variability of data rate in the end- 
to-end TCP connections on the Internet. Both the re- 
allocation phases and initial allocation are driven by the 
same global optimization goal, namely maximization of 
global throughput and resource utilization, subject to ful- 
fillment of individual task requirements. 

2.2 Architecture 

The stream processing platform can be viewed to be 
composed of the layers showed in Figure 2, with user ap- 
plications at the top. The applications are composed of data 
sources and several service components hosted by differ- 
ent servers. Therefore, the service components constitute 



the next layer. At the bottom layer, the resource manage- 
ment system (RMS) of the platform manages the available 
server and network resources to allow seamless execution 
of the service components. The main focus of this paper 
is to design and analyze the algorithms for various func- 
tionalities of the RMS layer. The RMS is responsible for 
mapping of the task requests on available resources and dy- 
namically adapting the resource allocations in response to 
various loading conditions. The two components of RMS 
cooperate to achieve these functionalities. A detailed dis- 
cussion on the RMS is presented in Section 3. RMS uses 
the local operating system API to control the underlying 
resources. Hence host OS and physical resources lie at the 
bottom of the layered architecture. 

2.3 Task Specification 

The specification of the stream processing task includes 
the ordered sequence of service components, the data 
source node, the data delivery node and the desired rate 
of data delivery. We assume a rate based model to specify 
resource requirement for each service component. For any 
service, both the output data rate and the CPU requirement 
are proportional to the input data rate, and are specified by 
two factors - the bandwidth shrinkage factor and the CPU 
usage factor, respectively. We assume that these two fac- 
tors for any service component is known globally. Thus 
any node receiving the task specification can compute the 
CPU and input/output data rate requirements for each ser- 
vice component. This rate based model is similar to the 
ones used by Kichkaylo et al. [16] and Drougas et al. [1 1]. 

The task specification is a service level agreement (SLA) 
between the user and the platform. On receiving the re- 
quest for resource for a task, the platform attempts to allo- 
cate necessary resources. The platform may be unsuccess- 
ful to allocate all necessary resources due to the loading 
condition of the platform, and the task may be rejected as a 
result. Once the task is accepted after successful resource 
allocation, it is responsibility of the platform to meet the 
constraints specified in the SLA. 

2.4 Pricing and Revenue Flow 

We assume a rate based pricing for the services. The 
task specification includes a price per byte of data deliv- 
ered. This price quote is directly translated to apportioned 
revenue for each of the service components, using the CPU 
usage and bandwidth shrinkage factors. The server that 
serves a service component receives revenue for each byte 
processed at this apportioned rate. In some cases, some 
server may need to forward the data without any process- 
ing, due to the particular task-to-resource mapping chosen. 
We assume there is a universally defined price charged by 
any server for per byte of data forwarding. Because the 
data forwarding path for service i to service i + 1 is chosen 




sample task: d 2 ->a 2 ^a 3 ^a 4 ^a 5 ^s 1 



Figure 1. Stream processing platform 

by the server of service i, it is assumed that any forward- 
ing price incurred before reaching the server serving the 
(i + l)th service is paid by the previous server. 

3 Decentralized Management of Server and 
Network Resources 

A resource management engine (denoted as RMS agent) 
runs in each server that implements the protocols for co- 
ordinated allocation of network and CPU resources. The 
resource management process is divided into two phases 
- initial mapping of individual tasks and dynamic re- 
allocation of the resources among competing tasks. Ac- 
cordingly, each RMS agent has two modules - a map man- 
ager and a dynamic scheduler. This section defines the two 
problems in details and illustrates the global picture that in- 
tegrates these two phases for global resource management 
objectives. The following two sections discusses the pos- 
sible solutions to these problems. 

A user of the distributed platform uses one of the server 
nodes as a portal to launch her stream processing task. The 
task specification submitted to the portal contains the ad- 
dress of data stream source and an ordered list of the ser- 
vice components that should process the data stream. By 
default the delivery point (destination) of the stream is the 
user's portal node, but any other node can be specified as 
well. The specification also includes the required rate of 
data delivery, time window for monitoring the rate and 
pricing for each byte of data delivered. The parameters 
such as data rate and pricing may be negotiated between 
the user and the portal through an automated SLA negoti- 
ation protocol, details of which is out of the scope of this 
paper. 

After receiving the specification from user, the portal 
node initiates the mapping process by sending a map mes- 
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Figure 2. Layered architecture 

sage with the initial mapping and the requirement specifi- 
cation to the data-source node. Through message passing 
among the map managers in different server nodes, the dis- 
tributed mapping algorithm results in a set of feasible maps 
at the map manager of the data-delivery node. Each of the 
maps defines a path from the data source node to the de- 
livery node through the server nodes that serve necessary 
service components. The best among the available feasible 
maps according to a certain cost metric is selected. We as- 
sume that the cost metric is additive and the cost is incurred 
at every node and link used by the task. 

A reservation probe is then sent from the data-delivery 
node to the data-source node along the path found in the 
selected map. The RMS agent at each server node along 
the path tries to allocate the server and link resources pre- 
scribed by the map. Because the mapping process for 
multiple tasks may be ongoing concurrently, it is possi- 
ble that the required resource is no longer available. In 
such case the allocation fails, the probe is rolled back and 
the next feasible map is probed by the data-delivery node. 
The streaming and the execution of the stream processing 
task begins once a successful probe reaches the data-source 
node at the other end. The message flow of mapping and 
reservation is illustrated in Figure 3. 

3.1 The Mapping Problem 

Abstracting away the details of the two classes of com- 
munication links and different types of service, the map- 
ping of a stream processing task on the network of servers 
can be described as a problem of constrained mapping of a 
weighted directed path on a weighted undirected graph. 

The network of servers can be defined as a graph Gr = 
(Vr, Er). Each vertex vr e Vr, denotes a server that 
has an available computational capacity C av (vu). Each 
edge en e Er denotes a data transport link with an 




Figure 3. Mapping, reservation and rollback 




Figure 4. Dynamic scheduling of link re- 
sources done by server Si on three com- 
peting task segments 02-03, 63-04, C2-C3 



available bandwidth B av (eR)- Each edge ej also has 
an associated additive cost W(cr). The stream pro- 
cessing task can be defined as a path Pj — (Vj,Ej), 
Vj = v = 8j,vi,V2, ... ,v m = tj and Ej = {e. t = 
(vi,Vi + i)\0 < i < to}. Each vertex m , < i < to of 
the stream processing task has a computational capacity 
requirement C req (vi), and each edge ej = (v$, < 

i < to has a bandwidth requirement B req (ei). 

The problem is to find mappings M v : Vj —> Vr and 
M e : Ej — > Pr, where Pr is the set of all possible 
paths in the resource graphs, including zero length paths. 
The second mapping M e is needed because a server node 
can act as forwarding nodes and thus, each edge in Ej 
can potentially be mapped on a multi-hop path pr in Gr. 
Also, multiple vertices from Vj can be mapped on a sin- 
gle vertex of Vr, which essentially maps edges from Ej 
on zero length paths, i.e. (v,v) paths with infinite band- 
width and zero cost. Again, it is allowed that for two dif- 
ferent edges, ei, e 2 G Ej, the mapped paths p\ = M e (ei) 
and P2 = M e (&2) have some common edges. The map- 
ping of the source node and the sink node is already given: 
M{s.,) = s r \sr G Vr and M(tj) = t R \t R G Vr. 

The mapping has to fulfill the following constraints on 
processing capacity and bandwidth - 



Vej = (u,v) G Ej,B(ej) < min[B(e R ),eR G M e (ej)] 

The constraints define the decision problem - "Is there 
any M and M e that satisfies the constraints?". This prob- 
lem can be proved to be NP-complete by transformation 
to the longest path problem [21]. The details of the proof 
can be found at [22, 23]. When the result of the decision 
problem is true, there can be multiple feasible mappings 
that satisfies the constraints. To choose a single mapping 



among the feasible ones, we can formulate a correspond- 
ing optimization problem, where each edge cr G Er in 
the resource graph has an additive cost W(vr). The objec- 
tive would be to find the feasible mapping that minimizes 
the total cost W = J2 W(p R )\p R G M e (u, v)Vu, v G Vj. 
Cost W(pr) of a path pr is the sum of the costs W{cr) 
of all edges cr in pr. 

Figure 5 shows an example resource network of eight in- 
terconnected computing nodes. Computational capacity of 
each node is represented by a number inside the node. The 
link bandwidth (B) and costs (d) are mentioned on each 
edge. An example stream processing task of path topol- 
ogy with one source s, one sink t and three computational 
nodes x%, X2, £3 is shown in Figure 6, with the node capac- 
ity and bandwidth requirements, s and t must be mapped 
on B and F, respectively. There can be many feasible map- 
pings of this dataflow computation on the resource graph 
in Figure 5. One of them is - 



M(s) 


= B 


M e (a,xi) = 


(B,B) 


M( Xl ) 


= B 


M e (xi,x 2 ) = 


(B,B) 


M(x 2 ) 


= B 


M e (x 2 ,x 3 ) = 


(B,D) 


M(x 3 ) 


= D 


M e (x 3 ,t) = 


(D,F) 


M(t) 


= F 







this is also the optimal solution in terms of total end-to- 
end cost between the resource nodes M(s) and M(t). 

We developed a decentralized algorithm that finds the 
exact solutions to the problem. As the problem is NP- 
complete, some approximation scheme is also proposed. 
The algorithm and approximation schemes are discussed 
in Section 4. 



Vvr G M(vj), ^ C re q(vj) < C av (v R ) 

{vj\vj£Vj,M(vj)=v r } 



lb] 




Figure 5. An example resource network 

3.2 The Dynamic Re-allocation Problem 

Allocation of the server and link resources by the map- 
ping process would suffice, if all the resources were ded- 
icated and under total control of the platform. Because 
the data rate over the links through the public network are 
variable and not under direct control of the platform, a con- 
tinuous adaptive allocation of the resources is necessary. 

To minimize the overhead, it is desirable that the re- 
allocation is done based on local information, otherwise a 
state-dissemination protocol will be necessary. The global 
objective of re-allocation is to maximize global processing 
throughput and keep the data-delivery rate for each task 
as close as possible to the SLA defined delivery rate. Lo- 
cally, each server can monitor the rate at which it processes 
data for each task using one of its services and the rate it 
transmits data to the next service for each task. This in- 
formation can be used to determine how closely the task 
is progressing compared to the SLA defined rate, because 
delivery rate is directly defined by the rate of processing 
by each service component. Maximizing the compliance 
at each server will imply maximum compliance at the de- 
livery point. 

However, it is hard to know the global throughput of 
the processed data from each server. We attempted to de- 
vise some local objectives, achieving which would lead 
to achievement of the global objective. Recall that each 
server allocates local resources autonomously and also 
each server is paid for each byte of data it processes. Ra- 
tionally, each server would be inclined towards maximiz- 
ing its own revenue. We devised the allocation policy so 
that it is consistent with such rational inclination of the 
servers. The expectation is that maximization of the local 
revenue needs maximizing local work throughput, which 
would lead to global throughput maximization. 



The adaptive reallocation is performed periodically at 
each server node and the period need not be synchronized 
globally. In principle, both the server and the link re- 
sources could be re-allocated. However, in the proposed 
system model, servers are dedicated for stream process- 
ing. A server accepts a task only if the requested amount 
of processing resource is available. Thus, once a task gets 
server resources allocated, its processing rate at that server 
does not vary over time. However, transmission rate over 
the opportunistic network links may vary over time, be- 
cause they are shared resources and not under complete 
control of the platform. Thus, to provide a predictable 
performance guarantee for accepted tasks, it is essential 
to adaptively re-allocate the link resources. On the other 
hand, although re-allocation of the server resources could 
improve load balancing and resource utilization because 
of changing load scenario, it is not possible to re-map the 
task components on new servers locally or based on local 
information, and the global mapping process has a lot of 
overhead. For these reasons, we confined the re-allocation 
within link resources only, leaving the initial allocation of 
server resources unchanged. 

The links that carry the stream between two data pro- 
cessing servers can be of three different types - i) a direct 
dedicated link, ii) a multi-hop dedicated link through one 
or more forwarding nodes iii) an overlay link through the 
public network. A mapping of a task may contain any com- 
bination of these three types of links between the process- 
ing nodes. Among them, the direct dedicated links are the 
most preferred one, because they provide controlled and 
stable data rate. A multi-hop dedicated link provides simi- 
lar control and stability, but it costs more (Section 2.4). The 
third possibility is having an overlay link through the pub- 
lic network. The flow rate is variable over such links, but 
there is no additional cost for sending data through them. 



So, the nodes try to opportunistically use these links when 
dedicated links are overloaded or not available. 

The dynamic link scheduler in each node is invoked pe- 
riodically at regular intervals. Based on current evaluation 
of locally observed performance, the scheduler re-allocates 
the locally available link resources among the competing 
tasks that are using this node. The overall policy of the 
scheduler is to prioritize the tasks based on observed per- 
formance and re-allocate the three possible types of out- 
going links based on the newly estimated priorities. The 
re-allocation process is illustrated in Figure 4. The re- 
allocation algorithm is described in Section 5, including 
the design of the appropriate priority function. 

4 Algorithm for the Mapping Problem 

In this section, we develop a decentralized algorithm for 
the constrained path mapping problem introduced in Sec- 
tion 3.1. The algorithm is then adapted through some ap- 
proximation heuristics and other modifications to use in the 
bi-modal stream processing platform. 

For distributed mapping, we use the scheme presented 
by Chandy and Misra [24], which was based on Dijkstra 
and Scholten's diffusing computation paradigm [25]. The 
centralized version of the problem, i.e. finding the map- 
pings when the global knowledge of the system state is 
available at a single node, can be solved using the Bellman- 
Ford relaxation scheme. Such an algorithm was analyzed 
in one of our previous works [22] . 

The distributed mapping algorithm uses two kinds of 
messages - i) map message and ii) ack message. The map 
messages propagate the partially computed maps from the 
data-source node to the data-delivery node through the net- 
work. The ack message is used for detecting termination 
of the mapping algorithm, as commonly used in diffusing 
computations. 

For each mapping, some variables are used to maintain 
the state of the diffusing computation - count maintains the 
number of outstanding ack messages to be received against 
the sent map messages, pred maintains the name of the 
predecessor node in the diffusing computation which made 
the current node aware of the mapping by sending a map 
message when count was 0. 

To disseminate the task specification to all the partici- 
pating nodes, another type of message, the spec message, 
is appended with the map message. To be efficient, when 
u sends a map message to v, it is sufficient to append the 
spec with the map only when u is the pred for v. For 
this knowledge, every node maintains a flag knows(i) for 
each neighbor i, and sets the flag when a spec appended 
map message is sent to i. To assist this process, the last 
ack sent by a node to its pred when count becomes 0, is 
differentiated from regular ack using a lastAck flag. A 
node resets the knows(i) flag when it receives an ack with 



the lastAck flag from its neighbor i. To save memory, 
each node creates a state for the new mapping process only 
when it becomes aware of the process by a map message. 
The state is initialized (pred — undefined, count = 0, 
knows(i) = false for all i) on creation. The state is re- 
moved when count becomes and the lastAck is sent. 

The mapping algorithm works primarily by enumerat- 
ing all feasible mappings on all possible paths. The opti- 
mal mapping is then chosen from the feasible mappings. 
However, feasible mappings are gradually expanded while 
exploring different paths and many of the mappings and 
paths are pruned or discarded once any of the resource con- 
straints fails. Thus explicit enumeration of all possible al- 
ternatives are avoided. 

Each node executes the processMap method (Algo- 
rithm 1) when a map message is received and the pro- 
cessAck method (Algorithm 4) when an ack message is 
received. Each time a node receives a partial map, it tries 
to extend the partial map in all possible ways by append- 
ing the mapping of more task components onto itself, sub- 
ject to availability of processing power (Line 15). Each of 
these newly generated partial maps are then extended to all 
of the neighbors as long as the bandwidth requirement of 
that hop in the task is less than available bandwidth in that 
link (Line 28). Note that it is possible to extend the map 
to the neighbors without having any component mapped 
on the current node. This allows multi-hop connection be- 
tween nodes processing consecutive components. This is 
beneficial in cases where there is no direct dedicated link 
between two server nodes. All the feasible mappings are 
thus accumulated at the data-delivery node. The acknowl- 
edgement process of the diffusing computation ensures ter- 
mination of the algorithm and allows each node to clear the 
states related to the terminated mapping. 

Cyclic mapping is allowed in the extension in Line 28. 
Because x = is allowed, it is possible that a mapping 
grows to an infinite length. In practice, this is avoided by 
limiting the growth of the multi-hop mapping using a bud- 
get factor. Based on the price-per-byte-processed quoted 
in the SLA (Sections 2.3 and 2.4), the allocated revenue 
for processing of the j — th service is limited. When the 
output of the j-th service is sent to the server providing 
(j + l)th service using a dedicated link, host of the j-th 
service needs to pay and thus loses revenue. The cost of 
transmission grows as more dedicated links are used in a 
multi-hop link to send the same data. Thus the number 
of hops in such multi-hop links are limited by the revenue 
budgeted for the service and cost of each hop of dedicated 
connection. This maximum hop restriction is summarized 
as the maxjiull parameter (Line 3) in the Algorithm 1 . 

One point to note here is that the partial mappings cannot 
be pruned using the optimality criterion, i.e. the cost met- 
ric. Even for the same prefix of the task, a lower cost map- 
ping may later get pruned by the resource constraint while 
a higher cost mapping may survive. Thus greedy pruning 



of the mappings based on the cost metric may not yield 
the optimal solution. However, analysis in the Section 4.1 
shows that such greedy pruning dramatically reduces the 
number of messages without sacrificing too much of the 
optimality. 

4.1 Heuristic Approximations 

We observe that, in the worst case, the mapping algo- 
rithm in Algorithm 1 may generate all possible source- 
destination paths in the graph and try all possible combina- 
tions of the task components on each of those paths. Such 
intractably explosive growth of complexity is expected be- 
cause the path mapping problem is NP-complete. For prac- 
tical implementations, it may be desirable to sacrifice some 
degree of optimality in favor of reduction in the complex- 
ity. Here, we explore some heuristic techniques that re- 
duce the complexity while producing good approximation 
for the optimal solution. 



4.1.1 LeastCostMap 

On intuitive way of reducing complexity is to greedily 
prune the exploration of many of the alternative paths and 
mappings based on the cost metric. In the LeastCostMap 
heuristic, a partial mapping that has higher cost compared 
to a previously observed mapping of the same prefix-length 
is pruned from further extension. To help this, each node, 
for each task-mapping, maintains a table of the costs of the 
least-cost partial mappings of each possible prefix lengths, 
among the already observed partial mappings of the com- 
posite task. The cost of the newly extended mapping in the 
Line 15 of Algorithm 1 is compared to that in the table and 
is sent to neighbors in Line 28 only if the new mapping has 
smaller cost. The cost in the table is updated accordingly. 



4.1.2 AnnealedLeastCostMap 

In the greedy pruning of higher cost partial maps, it is pos- 
sible that the mapping that would lead to the optimal so- 
lution is pruned while the allowed mapping does not meet 
the constraints in the later stages. One way to compromise 
between the greedy pruning and the unpruned exponential 
growth of mappings is to apply a kind of simulated an- 
nealing in the pruning process. A partial mapping of cost 
higher than the already observed minimum is allowed for 
extension with a probability and the probability diminishes 
exponentially with the growing prefix-length of the map- 
ping. This heuristic is hereafter denoted as AnnealedLeast- 
CostMap heuristic. Obviously, this approach increases the 
message complexity, with the hope that some of the non- 
minimal partial mappings would possibly lead to a better 
complete mapping. 



Algorithm 1 ProcessMap(m, T) 



l: Input: 

2: The current node executing the method is denoted as 
v. The sender of the message is u. 

3: T =t\, t2, • • • , i|T| denotes the ordered set of com- 
ponents in the stream processing task. Each ti has 
an associated C(i) denoting processing requirement. 
Each (ti, ti + i) has an associated B(i, i + 1) denoting 
the required bandwidth, maxjiull denotes the maxi- 
mum number of empty hops allowed in the map. T is 
either found appended with the map message or from 
the stored state. 

4: m is the map message containing the mapping of the 
first j services on a series of server nodes, j is called 
the prefix-length of m. 

5: For any node u, C av (u) denotes the computational 
capacity of u. S(u) denotes the set of service com- 
ponents served by u. For a pair of nodes u and v, 
B av (u 7 v) denotes available bandwidth in the (u, v) 
channel. 

if no state for T or pred is undefined then 
store T from the message 
create pred, count and knows 
pred <— u, count 

Vneighbork,k^uknOWs(k) <— FALSE 

else 

Send ac/c(REGULAR) to u 
end if 

for x = to \T\ - j -I do 

if (x = 0) or (tj +x G S(v) and 



0, 



then 



C av (v) 



> 



15: 

16: 

17: 
18 
19 

20 
21 

22: 
23 

24: 
25 
26 
27 
28 
29 
30 
31 
32 



+ Vj<jtj mapped on v C(i)) 
map found by extending next x services in 



m x <- 
Ton v 

if v is the data-delivery node and (j + x > \T\) 
then 

store m x in the list of a feasible maps 

end if 
else 

break 
end if 

for each neighbor k of v do 

H(B av (v,k) > B(j+x,j + x + l))aad((x > 0) 
or (empty hops in m < maxjnull — 1)) then 
if knows(k) =FALSEthen 
knows(k) <- TRUE 
Append T to m x 
end if 

Send m x to k 
count <— count + 1 
end if 
end for 
end for 



Algorithm 2 processAck(isFinal) 



ack message received from neighbor u 

count <— count — 1 

if count — and pred is not invalid then 

Send acfc(FINAL) to pred 

pred <— invalid 
end if 

if isFinal = FINAL then 

knows(u) <— FALSE 
end if 



4.1.3 RandomNeighbor 

Another way of restricting the message complexity is to ex- 
tend a partial map to a randomly chosen subset of k neigh- 
bors instead of expanding to all of them. Higher values 
of k increases the chance of getting the optimal solution. 
The RandomNeighbor heuristic with k = 1 did not pro- 
duce results as good as LeastCostMap, although number 
of messages were reduced dramatically. Further investiga- 
tion may be done to determine a suitable value of k. 

4.2 Performance of the Heuristics 

To choose one among the possible heuristics, we evalu- 
ated them running the heuristics on an emulated network 
of nodes. We tried to measure the quality of the approxi- 
mate solutions generated by the heuristics as well as their 
message overheads. The network topology was gener- 
ated by BRITE Internet topology generator [26], using the 
Barabasi- Albert algorithm [27]. This generates a power- 
law graph and the link bandwidths were sampled from a 
truncated power-law distribution having min= 10Mbps and 
max=lGbps. Computational capacities of the nodes were 
randomly assigned from a distribution of node-capacities 
of a volunteer computing project [28]. The nodes were 
emulated as processes hooked to UDP ports in LAN- 
connected computers. These virtual nodes communicated 
among them using UDP packets. The network size was 
varied from 30 to 120 nodes. The tasks for mapping con- 
sisted of 10 components. The bandwidth and capacity re- 
quirements of each task-component was sampled from a 
Normal distribution with mean equal to the 50% of the av- 
erage link and code capacity of the network, respectively. 

First, we attempted to evaluate how close the solutions 
generated by the heuristics are to the exact optimal solu- 
tions. Because it is computationally expensive to run the 
algorithm that gives the exact optimal solution, we devised 
an algorithm that computes a lower bound of the optimal 
solution. We relaxed the bandwidth constraints and trans- 
formed the problem into finding a optimal cost path in a 
multi-stage graph. The first and last stages resemble the 
source and the terminal nodes. Each of the internal stages 
have n vertices, resembling the choice of any of the n 



servers for the processing components of the tasks. Then 
we compute the lowest cost path from source to the ter- 
minal vertex, subject to the node-capacity constraints only. 
Ignoring the bandwidth constraints allows lower cost so- 
lutions that are not feasible in the actual problem. All the 
feasible solution for the actual problem will be feasible in 
the relaxed problem. So, the optimal solution of the re- 
laxed problem will be a lower bound on the optimal cost of 
the actual problem. We computed the ratio of the cost of 
heuristic generated solutions to this lower bound cost. 

To assess the cost of executing the heuristics, we 
counted the total number of map messages exchanged 
among the nodes. Because arrival of each map message in- 
vokes the processing algorithm on the receiving node, the 
total computational cost is also proportional to the number 
of map messages. Although we did not evaluate the mes- 
sage complexity of the exact algorithm, we have compared 
the complexities of the heuristics, which helps to choose 
one heuristic over the others. 
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Figure 7. Comparing three heuristics 

Figure 7(a) shows that the heuristic derived solutions 
are fairly close to the lower bound of the optimal solu- 
tions. One can observe that both the LeastCostMap and 
the AnnealedLeastCostMap yield solutions that are equally 



very close to the optimal solutions. The RandomNeighbor 
heuristic does not produce good solutions, because num- 
ber of feasible ways to expand the partial maps narrows 
down very quickly here. In terms of cost of computation 
of the heuristics, we observe in Figure 7(b) that number 
of map messages to complete mapping of a single task- 
composition is much higher in the AnnealedLeastCostMap 
heuristic than the other two heuristics. This is because, for 
the chosen parameter setting, the AnnealedLeastCostMap 
extends many more of the alternative paths and mappings 
compared to the LeastCostMap heuristic. Analyzing both 
the results in Figure 7(a) and Figure 7(b), it may be con- 
cluded that the additional message overhead due to the late 
pruning in the AnnealedLeastCostMap does not worth its 
gain in optimality. Finally, we chose the LeastCostMap 
heuristic for our distributed stream processing platform. 

4.3 Modifications for Bi-Modal Commu- 
nication Links 

So far, in the design of the decentralized mapping algo- 
rithm, we did not consider the presence of the opportunistic 
communication links. As mentioned in the system model 
in Section 2, each node is connected to the public Internet 
and can establish an end-to-end connection with any other 
node. The presence of these all-to-all links require some 
modifications in the ProcessMap procedure described in 
Algorithm 1. 

Because, with opportunistic links, all other nodes in the 
platform are neighbors in terms of connectivity, sending of 
extended maps to all neighbors in Line 28 of ProcessMap 
would be inefficient, although it would work. Instead, af- 
ter extending the mappings to all the dedicated link neigh- 
bors, the mappings may be extended to a small subset of 
the opportunistic-link-neighbors. To choose a subset, we 
assume that each node has an approximate knowledge of 
which node serves which service. We assume that there 
exist a gossip mechanism to disseminate this knowledge. 
Note that the set of services available at a node changes 
much less frequently compared to arrival of individual task 
mapping requests. Moreover, this knowledge is used only 
for minimizing the overhead, thus its inaccuracy does not 
harm much other than missing some possible solutions. 
Another point to note is that having all-to-all connectiv- 
ity, there is no meaning of mapping a hop of the task- 
composition on multiple hops of opportunistic links, al- 
though multi-hop dedicated connection is still preferable. 

The final version of the ProcessMap algorithm that ap- 
plies the Least CostMap heuristic and takes care of oppor- 
tunistic links is presented in Algorithm 3. The M(l : |T|) 
data structure (Line 5) to store the costs of the minimum- 
cost mapping among the already observed partial maps, 
and the condition in Line 20, are added for the Least- 
CostMap heuristic. The other additional code in Lines 31- 
37 handles the extension of the mappings through oppor- 



Algorithm 3 ProcessMap2( u, m, T) 



1: Input: As described in Algorithm 1 

2: if no state for T or pred is undefined then 

3: store T from the message 

4: create pred, count and knows 

5: create M(l : \T\) 

6: pred <— u, count <— 0, 

Vneighbork,k^uknOWs(k) FALSE 

1: VjM (i) «- inf 
8: else 

9: Send acfc(REGULAR) to u 
10: end if 

li: iorx = Oto \T\ -j-1 do 

12: if (x = 0) or (tj +x e S(v) and C av (v) > 
YTj=i c (j + x ) + Vi<jU mapped on v J] C(i)) 
then 

13: m x <— map found by extending next x services in 
Ton v 

14: if v is the data-delivery node and (j + x > \T\) 
then 

15: store m x in the list of a feasible maps 

16: end if 
17: else 
18: break 
19: end if 

20: if cost{m x ) < M{\m x \)) then 

21: for each dedicated-link-neighbor k of v do 

22: if(B av (v,k) > B(j + x,j + x+l))and((x > 

0) or (empty hops in m < vnaxjnull — 1)) 

then 

23: if knows (k) =FALSE then 

24: knows(k) <- TRUE 

25: Append T to m x 

26: end if 

27: Send m x to k 

28: count <— count + 1 

29: end if 

30: end for 

31: ifx>0then 

32: for each node k such that k provide the service 

j + x + 1 do 

33: if available uplink bandwidth to the Internet 

> bandwidth need for service hop (j+x, j + 
x + 1) then 

34: Send m x to k 

35: end if 

36: end for 

37: end if 

38: end if 

39: end for 



tunistic links. Note that such extension is allowed only 
when at least one task-component is mapped on the cur- 
rent node (Line 31). Because it is not possible to allocate 
end-to-end bandwidth in the opportunistic links, only the 
uplink bandwidth is allocated. The end-to-end bandwidth 
that a task actually gets is monitored and reactively allo- 
cated in a continuous feedback loop, which we will discuss 
in the next section. 

To devise an appropriate cost metric for choosing the 
best among alternative feasible maps, we considered the 
following two factors - balancing the service workload 
among the servers and minimizing the uncertainty of us- 
ing opportunistic links. The load-balance factor for a map 
(or a partial map) is computed as an average of the server 
load-factors (ratio of used capacity to total capacity) for 
all the servers included in the map, and is always a num- 
ber between and 1 . A map with lower load-balance factor 
spreads the components of a task on different servers rather 
than putting all of them into one, and chooses the under- 
utilized servers. In case two maps have almost same load- 
balance factor, (do not differ by more than 0.1 or 10%), 
then the one in which the number of hops (links connecting 
the processing components) assigned to dedicated links is 
higher is preferred. If that is also same, the map with least 
number of hops through public network is preferred. 

5 Adaptive Re-allocation of the Bi-modal 
Links 

The dynamic link scheduler in each node is invoked pe- 
riodically at regular intervals. Based on current evaluation 
of locally observed performance, the scheduler re-allocates 
the locally available link resources among the competing 
tasks that are using this node. The overall policy of the 
scheduler is to prioritize the tasks for use of the network 
links, based on their deviation from target data rate and the 
price they would pay for the data processing service. 

The links that carry the stream between two data pro- 
cessing servers can be of three different types - i) a direct 
dedicated link, ii) a multi-hop dedicated link through one 
or more forwarding nodes iii) an overlay link through the 
public network. A mapping of a task may contain any com- 
bination of these three types of links between the process- 
ing nodes. Among them, the direct dedicated links are the 
most preferred one, because they provide controlled and 
stable data rate. A multi-hop dedicated link provides simi- 
lar control and stability, but it costs more (Section 2.4). The 
third possibility is having an overlay link through the pub- 
lic network. The flow rate is variable over such links, but 
there is no additional cost for sending data through them. 
So, the nodes try to opportunistically use these links when 
dedicated links are overloaded or not available. 

Algorithm 4 is executed when the scheduler is invoked 
at regular intervals. The algorithm evaluates the For al- 



Algorithm 4 Link re-allocation algorithm 
1: Invoked for each node u periodically 
2: Group the tasks that are being processed in u by their 

next hop server v 
3: for Each group v do 

4: Compute the priority of each flow competing for a 
(u,v) link as - 

5: priority <— budget per byte of processed data * band- 
width required to comply with the target rate 
6: if any dedicated link (u,v) exists then 
7: Assign the dedicated link to top priority flows un- 
til all capacity is used 
8: end if 

9: Collect all the unassigned flows 
10: end for 

11: for All the remaining flows do 
12: if The budget permits fc-hop (u,v) dedicated link, 
k > 1 then 

13: Launch a probe search and reserve multi-hop ded- 
icated path for the flow with maximum k hops 

14: Assign public network bandwidth for the flow 
temporarily 

15: else 

16: Assign public network bandwidth for the flow 
17: end if 
18: end for 



location of the links, tasks are grouped according to their 
next hop server node (Line 2). While prioritizing among 
competing tasks for each group (Lines 3-10), the sched- 
uler tries to maximize the revenue earning of the server and 
prefers the tasks marked with higher price per unit of pro- 
cessing. On the other hand, the server tries to fulfill the rate 
requirement of each task, because it gets penalized other- 
wise. Hence the scheduler computes the priority of each 
task as a product of the apportioned price and the data rate 
required in next scheduling epoch. 

For each next hop group, highest priority tasks get allo- 
cation from the direct dedicated link, if such a link exists 
and capacity permits (Line 7). The next prior tasks are 
assigned multi-hop dedicated links (Lines 13). The max- 
imum possible hops in such multi-hop links are restricted 
by the apportioned price for that service according to the 
task specification. The remaining tasks from all the groups 
are allocated bandwidth from the opportunistic public net- 
work links (Line 16). 

5.1 Is Adaptive Re- Allocation Neces- 
sary? 

So far, we have argued that due to the inconsistent be- 
havior of the opportunistic links, it is necessary to real- 
locate the link resources periodically in a feed-back loop. 
Here we asses the necessity of such re-allocation quantita- 



tively. 

The main intuition behind introducing dynamic re- 
allocation is that the data-stream that goes through the pub- 
lic network suffers from the variability and lag from the 
target rate, whereas the stream that uses dedicated links 
all-through, does not lag from the target at all. Dynamic 
scheduling introduces fairness across all the tasks. So if 
link assignment is done dynamically, it is expected to im- 
prove the utilization of the resources and increase the over- 
all work-throughput of the system. 

For the evaluation we used a 100-node simulated stream 
processing platform. Details of the simulation set-up is de- 
scribed later in Section 6.1. We fed the same workload to 
two system set-ups, both having bi-modal communication 
networks. In one, we disabled the adaptive re-allocation of 
links and let the tasks complete with the initial assignment 
of links and nodes. The adaptive re-allocation is enabled in 
the other. All other system parameters were the same for 
both the set-ups. From Figures 8(a) we observe that overall 
system throughput increases with adaptive re-allocation, as 
an indication of higher task acceptance ratio and higher uti- 
lization of the system resources. Figure 8(b) demonstrates 
that adaptive re-allocation results in much higher utiliza- 
tion of the dedicated links. CPU utilization remains un- 
changed (not shown), because the dynamic re-allocation 
does not alter the node assignments. Another rationale 
behind re-allocations is to increase fairness and improve 
compliance with the target delivery rate. Figure 8(c) shows 
that irrespective of workload, the adaptive re-allocation de- 
creases the deviation from the specified target rate. 

6 Performance Evaluation and Discussion 

6.1 Simulation Model 

We constructed a simulation model of the distributed 
stream processing platform according to the architecture 
and algorithms presented in Sections 2 and 3, respectively. 
The model was build on Java based simulation engine 
JiST [29]. 

Each of the servers in distributed locations are connected 
to the public Internet. Although each server has a certain 
uplink and downlink bandwidth, the data rate over a con- 
nection that goes through the public network faces tempo- 
ral variation. We use the statistics presented by Wallerich 
and Feldmann [30] to model the temporal variability of 
the end-to-end capacity of a path through the public net- 
work. From their data collected from packet level traces 
from core routers of two major ISPs over 24 hours, the 
logarithm of the ratio of the observed transient flow rate 
to the mean flow rate over long period is almost a Nor- 
mal distribution. In our simulations, all flows on the public 
network are perturbed every 10 milliseconds according to 
this model. With the allocated bandwidth as the mean rate 



and the standard deviation of the log-ratio set at 1, in 95% 
of the cases the observed bandwidth remains between one 
fourth (2~ 2<T ) and four time (2 2<T ) of the allocated or mean 
bandwidth. Bandwidth of each last-mile connection (up- 
link and downlink) is randomly assigned between 1 Mbps 
and 2 Mbps. 

In addition to the public network links, the servers are 
interconnected through dedicated links (which may be 
leased lines or privately installed links). For the dedi- 
cated network, we assume a preferential connectivity based 
network growth model similar to the one proposed by 
Barabasi et al [27]. The basic premise here is that when 
a server attempts to establish a dedicated link, it does so 
preferably with the most connected server. This eventually 
results in a power law degree distribution in the network. 
We assumed that server CPU capacity is proportional to 
the number of dedicated links it has. The variety of ser- 
vices that a server can host is also proportional to the node 
degree or capacity. The dedicated links have much higher 
bandwidth than the network links connecting a node to the 
public network. Their bandwidths were randomly assigned 
between 1 Mbps and 10 Mbps and the propagation delays 
were assumed to be between 1 and 10 milliseconds. The 
propagation delay of an end-to-end connection through the 
public network was much higher and assumed to be be- 
tween 10 and 100 milliseconds. 

Unless otherwise mentioned, we assumed the platform 
to have 100 server nodes and 99 dedicated links intercon- 
necting them. There were 25 different types of services. 
As the service variety is proportional to the node degree, a 
node having d dedicated links was assumed to host 1 + d 
different types of services (one added for public network 
link). Server CPU capacity was set such that it can exe- 
cute k instances of each service concurrently, according to 
the mean data delivery rate. We set k = 2. For the task 
workload, each task is assumed to have 10 service compo- 
nents, randomly chosen from 25 different types of service. 
Mean data delivery rate was 1Mbps and total amount of 
data to be processed from the source was 100MB on av- 
erage. Each data point on the results shown below is an 
average of 100 observations from different experiments on 
randomly generated networks with specified parameters. 
For each experiment, a synthetic workload trace contain- 
ing 500 stream processing tasks were generated. The task 
arrival process is assumed to be Poisson, with the arrival 
rate varying across the experiments. If not mentioned oth- 
erwise, the default arrival rate was 60 tasks per hour. 

6.2 Benefits of Combining Opportunistic 
and Dedicated Resources 

We performed several sets of experiments to evaluate the 
benefits of using bi-modal networks for stream processing 
tasks. In the experiments, we compare three possible set- 
tings - i) a network with the dedicated links only, ii) public 
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Figure 8. Effect of dynamic scheduling 



network only, and iii) a network that combines both. 

First argument in favor of a bi-modal network for stream 
processing is that combining the public network with 
dedicated links, the system achieves much higher work 
throughput at the same cost. To examine this, we fed sim- 
ilar workload traces under same arrival rates to two sys- 
tem set-ups, one with only dedicate link based networks 
and the other using the combination of dedicated links and 
public network. From Figure 9(b) we observe that for the 
same workload, if the platform uses dedicated links only, 
it needs more than 120 links to get 50% acceptance ratio, 
whereas the same acceptance ratio can be obtained with 
50 dedicated links only, if the public network is utilized in 
conjunction. Similar evidence in Figure 9(a) shows that in- 
clusion of the public network helps to achieve same overall 
system throughput at much lower number of dedicated link 
installations. 

The next argument is that utilization of the privately de- 
ployed expensive dedicated resources such as servers and 
dedicated links is increased, if inexpensive public network 
is used in conjunction. From Figure 9(c) we observe that 
when a combination of dedicated links and the public net- 
work is used, the server utilization is higher than the sum 
of utilizations of cases using a single type of network links. 

Figures 9(e) and 9(f) show another evidence of higher 
return on investment. In Figure 9(e), we observe that the 
utilization of dedicated links becomes consistently higher 
across a wide range of loading scenarios if the public net- 
work is used in combination. The lower utilization in case 
of a dedicated link only network results from the fact that 
the platform has rejected many task requests that would 
have been feasible by the augmentation of the public re- 
sources. Figure 9(f) shows the variation of utilization of 
the dedicated links with the number of dedicated links. We 
observe that the difference in utilization diminishes as the 
number of installed links increases. This is because when 
there is sufficient number of dedicated links to carry the 
required traffic of all the tasks, the public resources are not 
used at all, and the bi-modal system becomes equivalent to 
a dedicated link only system. In both cases, utilization of 
the links keeps decreasing when more and more links are 



added because the workload is held constant. 



The discussion above highlighted the benefits of using 
public network towards improving the utilization of dedi- 
cated server and link resources (i.e., increases in return on 
investment). Next we investigate how the bi-modal net- 
work helps the stream processing platform to improve the 
compliance with the services contracts it has with indi- 
vidual tasks. We measure the compliance of the stream 
processing platform as follows. Each task request speci- 
fies a time window T that is used to monitor the delivery 
rate. We measured the deviation from the required rate as 

Eover all windows "TT ' where B is desired rate and 
B is the observed rate of delivery. In Figure 9(g), we ob- 
serve that use of dedicated links brings the percent devia- 
tion down to between 10% and 20% from above 50%. In 
this case the number of installed dedicated links was just 
enough to make a spanning tree of the nodes, i.e. N — 1 
links for N nodes. Note that deviation is counted on the ac- 
cepted jobs only. So, even though for a dedicated link only 
network, the deviation is almost zero, we have seen that 
such network is unable to accept enough jobs to fully uti- 
lize the resources. In Figure 9(h), we observe that the de- 
viation in the bi-modal system gets closer to zero as more 
and more dedicated links are added to the network. How- 
ever, beyond certain number of links, (125 in this particular 
experiment), the improvement is very marginal. 



When we use a combination of dedicated and public 
links, it is expected that the completion time of each task 
will be slightly elongated compared to a system with only 
dedicated links, due to the variability in the public network. 
Nevertheless, using the combination contains the elonga- 
tion to a small value, compared to the case where only 
public network is available. In Figure 9(i), we observe a 
10 — 20% increase in the execution time in the bi-modal 
system, whereas execution time would be 200 — 300% 
more in case of a public network only system. 
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Figure 9. Comparing bi-modal and uni-modal networks 



7 Related Work 

Although there is a vast body of literature on resource 
management in cluster, Grid or peer-to-peer hosting plat- 
forms, there have been relatively a very few works that 
proposes combined use of dedicated and public resources. 
In [31], Kenyon et al. provided arguments based on math- 
ematical analysis, that commercially valuable quality as- 
sured services can be generated from harvested public 
computing resources, if small amount of dedicated com- 
puters can be augmented with them. With simple models 
for available periods of harvested cycles, their work have 
measured the amount of dedicated resources necessary to 
achieve some stochastic quality assurance from the plat- 
form. However, they did not study how a bi-modal plat- 
form would perform in the presence of clients with dif- 
ferent service level agreements and how to engineer the 
scheduling policies to maximize the adherence to these 



agreements. 

Recently, in [32], Das et al. have proposed the use of 
dedicated streaming servers along with BitTorrent, to pro- 
vide streaming services with commercially valuable qual- 
ity assurances while maintaining the self scaling property 
of BitTorrent platform. With analytical models of Bit- 
Torrent and dedicated content servers they have demon- 
strated how guaranteed download time can be achieved 
through augmentation of these platforms. However, their 
proposal does not include actual protocols that can be used 
to achieve these performance improvements. 

Architectures and resource management schemes for 
distributed stream processing platforms have been studied 
by many research groups from distributed databases, sen- 
sor networks, and multimedia streaming [1, 2, 6, 3]. In 
database and sensor network research, the major focus was 
placing the query operators to nodes inside the network 
that carries the data stream from source to the viewer [7]. 



In multimedia streaming problems, similar requirements 
arise when we need to perform a series of on-line opera- 
tions such as trans-coding or embedding on one or more 
multimedia streams and these services are provided by 
servers in distributed locations. In both cases, the main 
problem is to allocate the node resources where certain 
processing need to be performed along with the network 
bandwidths that will carry the data stream through these 
nodes. 

Finding the optimal solution to this resource allocation 
problem is inherently complex. Several heuristics have 
been proposed in the literature to obtain near-optimal solu- 
tions. Recursive partitioning of the network of computing 
nodes have been proposed in [15] and [4] to map the stream 
processing operators on a hierarchy of node-groups. They 
have demonstrated that such distributed allocation of re- 
sources for the query operators provides better response 
time and better tolerance to network perturbations com- 
pared to planning the mapping at a centralized location. 

In [33] and [8], the service requirements for multi-step 
processing of multimedia streams, defined in terms of ser- 
vice composition graphs have been mapped to an overlay 
network of servers after pruning the whole resource net- 
work into a subset of compatible resources. The map- 
ping is performed subject to some end-to-end quality con- 
straints, but the CPU requirements for each individual ser- 
vice component is not considered. Liang and Nahrstedt 
in [12] have proposed solutions to the mapping problem 
where both node capacity requirement and bandwidth re- 
quirements are fulfilled. However, one of the assumptions 
made by Liang and Nahrstedt was that the optimization al- 
gorithm was executed in a single node and complete state 
of the resource network is available to that node before ex- 
ecution. In a large scale dynamic network this assumption 
is hard to realize. If we assume that each node in the re- 
source network is aware of the state of its immediate neigh- 
borhood only, we need to compute the solution using a dis- 
tributed algorithm such as ours. 

In all of the abovementioned works, the operator nodes 
are assumed to interconnected through an application de- 
pendent overlay network using the Internet as underlay. 
In [34], Gu and Nahrstedt presented a service overlay net- 
work for multimedia stream processing, where they have 
shown that dynamic re-allocation of the operator nodes 
provides better compliance with the service contracts in 
terms of service availability and response time. However, 
none of the works have proposed the use of dedicated links 
in conjunction with IP overlay network for improving ad- 
herence to the service contracts. 

8 Conclusion 

In this paper, we investigated the resource management 
problem with regard to data stream processing tasks. In 



particular, we examined how a hybrid platform made up of 
dedicated server resources and bi-modal network resources 
(dedicated plus public) can be used for this class of ap- 
plications. From the simulation based investigations, we 
were able make several interesting observations. First, bi- 
modal networks can improve dedicated resource utilization 
(server plus dedicated network links). This means higher 
return on investment can be obtained by engaging the bi- 
modal network. Second, the overall system is able to admit 
and process tasks at a higher rate compared to system con- 
figurations that do not leverage a bi-modal network. Be- 
cause the public network is engaged at zero or very low 
cost, this improvement in throughput can be result in sig- 
nificant economic gain for institutions that perform data 
stream processing workloads. Third, the engagement of 
bi-modal network comes at a slight overhead that adds low 
delays in stream processing tasks. Compared to public- 
only networks the delays provided by the bi-modal net- 
work is almost negligible. Fourth, dynamic rescheduling 
is essential to cope with varying network conditions - par- 
ticularly in the public network. The dynamic reschedul- 
ing algorithm switches the flows according to the recom- 
puted priority values to achieve the best service level com- 
pliances. 

In summary, our study highlights the benefits of the bi- 
modal architecture for compute- and network-intensive ap- 
plications. Moreover, it provides simple distributed algo- 
rithms that allows the effective utilization of such a plat- 
form for data stream processing applications. Deploying 
the distributed resource management framework in an ac- 
tual prototype for data stream mapping is a possible future 
work. 
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